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Abstract 

Background: Certain post-translational modifications to histones, including H3K4me3, as well as binding sites for 
the transcription factor STAT1, predict the site of integration of exogenous gamma-retroviruses with great accuracy 
and cell-type specificity. Statistical methods that were used to identify chromatin features that predict exogenous 
gamma-retrovirus integration site selection were exploited here to determine whether cell type-specific chromatin 
markers are enriched in the vicinity of endogenous retroviruses (ERVs). 

Results: Among retro-elements in the human genome, the gamma-retrovirus HERV-H was highly associated with 
H3K4me3, though this association was only observed in embryonic stem (ES) cells (p < 10" 300 ) and, to a lesser 
extent, in induced pluripotent stem (iPS) cells. No significant association was observed in nearly 40 differentiated 
cell types, nor was any association observed with other retro-elements. Similar strong association was observed 
between HERV-H and the binding sites within ES cells for the pluripotency transcription factors NANOG, 0CT4, and 
S0X2. NANOG binding sites were located within the HERV-H 5'LTR itself. 0CT4 and S0X2 binding sites were within 
1 kB and 2 kB of the 5'LTR, respectively. In keeping with these observations, HERV-H RNA constituted 2% of all poly 
A RNA in ES cells. As ES cells progressed down a differentiation pathway, the levels of HERV-H RNA decreased 
progressively. RNA-Seq datasets showed HERV-H transcripts to be over 5 kB in length and to have the structure 
S'nR-gag-pfo-3'nR, with no evidence of splicing and no intact open reading frames. 

Conclusion: The developmental regulation of HERV-H expression, the association of HERV-H with binding sites for 
pluripotency transcription factors, and the extremely high levels of HERV-H RNA in human ES cells suggest that 
HERV-H contributes to pluripotency in human cells. Proximity of HERV-H to binding sites for pluripotency 
transcription factors within ES cells might be due to retention of the same chromatin features that determined the 
site of integration of the ancestral, exogenous, gamma-retrovirus that gave rise to HERV-H in the distant past. 
Retention of these markers, or, alternatively, recruitment of them to the site of the established provirus, may have 
acted post-integration to fix the provirus within the germ-line of the host species. Either way, HERV-H RNA provides 
a specific marker for pluripotency in human cells. 

Keywords: HERV-H, Endogenous retrovirus, Pluripotency, Long non-coding RNA, Embryonic stem cell, Induced 
pluripotent stem cell 



Background 

Vertebrate genomes contain retroviral sequences that are 
believed to be remnants of exogenous retroviral infection 
from the distant past [1]. The genesis of these endogenous 
retroviruses (ERVs) necessitates establishment of provirus 
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by the ancestral, exogenous retrovirus within host germ 
cells, such that these elements are maintained as heritable 
genetic elements in the host species. Human endogenous 
retroviruses (HERVs) are not uncommon, and account for 
at least 8% of the total human DNA. By comparison with 
DNA sequences from exogenous retrovirus families, three 
main classes of HERVs, gamma, beta, and delta have been 
identified [2]. Phylogenetic analyses identified HERV-K, a 
betaretrovirus, and HERV-H, a gammaretrovirus, as the 
most recent entries into the genomes of primates, 10 and 
25 million years ago, respectively [3,4]. 
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Though some HERVs are transcriptionally active, most 
retroviral sequences in the human genome are corrupted 
by mutations or successive insertion of transposable ele- 
ments; most HERVs lack intact open reading frames for 
viral protein production and no autonomously replicating 
HERV has been identified. As a result, HERVs are gener- 
ally considered non-functional, junk DNA. In some cases, 
though, HERVs make significant - even essential - contri- 
butions to the normal physiology of the host species [5-8]. 
HERVs have also been implicated in the development of 
pathological conditions [9-14]. Interestingly, HERVs can 
be activated by exogenous retrovirus infection [15] and, in 
reciprocal fashion, the immune response to these elements 
can influence the outcome of infection by pathogenic, ex- 
ogenous retroviruses such as HIV-1 [16,17]. 

Exogenous retroviruses integrate into locations through- 
out the host chromosomal DNA in a quasi-random fashion 
(reviewed extensively in [18]). Previously, we developed 
statistical tools to identify association between the sites of 
provirus establishment and chromosome marks - as deter- 
mined by chromatin immuno-precipitation with massively 
parallel DNA sequencing (ChlPSeq) [18,19]. From this 
analysis we identified chromatin modifications (H3K4me3, 
H3K4mel, and H3K9ac) and binding sites for transcription 
factors (STAT1) that predict the site of integration for 
gamma-retroviruses with great accuracy in a cell-type spe- 
cific manner [18,19]. Precise markers such as these were 
not identified for other classes of retrovirus such as the 
lentivirus HIV-1 [18]. Here, the same statistical methods 
were exploited to determine whether any ERVs retain asso- 
ciation with cell type-specific chromatin features that 
might have determined the site of integration in the distant 
past, or that served to fix the provirus in the genetic patri- 
mony of the host species. 

Results 

Search for chromatin features near the site of ERVs 

Previously observed, high-level association between 
gamma-retrovirus integration sites and particular epi- 
genetic markers [18] prompted a quest to find associ- 
ation between any known endogenous retroviral element 
in the human and mouse genomes, and the cell-type 
specific localization of particular chromosomal features. 
For this purpose, ChlPSeq profiles were evaluated from 
more than 40 different human and mouse cell types, in- 
cluding ES cells, iPS cells, monocytes, HeLa cells, CD4+ 
T cells, and CD34 + hematopoietic cells (Table 1). 
H3K4me3 was used in the initial analysis because of the 
availability of ChlPSeq datasets from a large number of 
cell types for this marker. The ERV dataset was com- 
piled using all endogenous, LTR-containing elements 
annotated via RepeatMasker on the human reference 
genome hgl8 (UCSC) or the mouse reference genome 
mm9 (UCSC). 



Figure 1 shows this analysis schematically. The block 
Associator is a computational module fed by ChlPSeq 
profiles and LTR loci. Among all LTRs, only HERV-H 
showed a significant association with H3K4me3, though 
this association was only with some human cell types; the 
location of the endogenous gamma-retrovirus was asso- 
ciated with H3K4me3 profiles in ES cells (F-score >0.8; 
p < 10" 300 , by Fisher exact test) and, to a lesser extent, 
in iPS cells (F-score >0.7; p < 10" 100 ). F-scores > 0.5 are 
considered significant with 1.0 maximal [18], so these 
values are highly significant. In contrast, no retroviral 
element in the mouse was found to be associated with 
H3K4me3. 

The data were assessed by means of Hierarchical Clus- 
tering. Each association profile was calculated as a func- 
tion of the distance between HERV-H and the nearest 
H3K4me3 marker, discretized with steps of 0.5 kB. The 
specific cells are listed on the x axis of Figure 2, and the 
window size in kilobases is shown on the y-axis. The 
Euclidean distance between these profiles was used to 
discriminate among clusters. The algorithm identified 
three main clusters of high, medium and low association 
(represented in red, green and blue in Figure 3). The 
cluster with the highest association (red) was populated 
by ES cells (HI, H9 and 13) and iPS cells (iPS-15b and 
iPS-lla), with a mean F score of 0.74. The cluster with 
medium association (green) was populated by bone mar- 
row mesenchymal stem cells (DMSC), breast cancer cells 
(vHMEC), fetal lung cells and fetal brain cells; it had F 
scores that were just barely significant. Some iPS cells 
(A6, PDBllox, 18c, CI) fall within this cluster, thus con- 
firming that many iPS cells have epigenetic profiles dis- 
tinct from those of ES cells [29]. The blue cluster 
consisted of differentiated cells, including CD4 + T cells, 
fibroblasts, pancreatic islet cells, and HeLa cells; the 
mean F score of 0.36 was insignificant. The association 
of HERV-H with the post-translational histone modifica- 
tion H3K4me3 correlated strongly with the degree of cell 
differentiation, a relationship that was clearly visible 
using chromosome projection mandalas (Figure 4). 

HERV-H expression in ES cells 

Given the remarkable, pluripotent stem cell-specific associ- 
ation of HERV-H with H3K4me3, a marker for transcrip- 
tionally active promoters [30,31], HERV-H expression 
would be expected to be higher in ES and iPS cells than in 
differentiated cell types. Consistent with this possibility, 
HERV-H was not associated with H3K27me3 (F score 0.2), 
a marker for transcriptional repression in ES cells [32]. EN- 
CODE Project RNA-seq data sets [33] of paired 75 nucleo- 
tide reads from human ES cells and 6 differentiated cell 
types were assessed for HERV-H RNA peaks. In HI human 
ES cells, HERV-H RNA accounted for 2% of the total 
RNA. This extraordinary level of HERV-H expression was 
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Table 1 Data used for analysis of endogenous retroviruses 



Data type GEO accession number Factor 



Cell type or tissue 



Reference 



Human 

ChlPSeq, GSE16256 
ChlPSeq, GSE22499 
ChlPSeq, GSE15353 
ChlPSeq, GSE19465 

ChlPSeq, GSE20650 
ChlPSeq, GSE18292 
RNASeq, GSE23316 
RNASeq, GSE20301 
Mouse 

ChlPSeq, GSE22075 
ChlPSeq, GSE12241 



H3K4me3, H3K27me3 
H3K4me3 
H3K4me3 
H3K4me3 

NAN0G,0CT4, KLF4 
S0X2 



Mouse H3K4me3 
Mouse H3K4me3 



HI, H9, 13, IPS, IMR90 

BG01/03, WIBR1/2/7, hiPSA6/C1, Fibroblast 
HeLa w/o IFN 

hiPS-1 1a/18c/15b/20b, duodenum mucosa, BM-MSC,smooth 
muscle, adult liver, fetal lung, fetal brain, CD34, CD3, CD15, 
CD19, Pancreatic Islets 

H1 

H1 

HI, HeLa, K562 

H1, differentiated H1 

ES, LSK cells 
ES, MEF 



[20] 
[21] 
[22] 
[23] 

[24] 
[25] 

UCSC ENCODE Project 
[26] 

[27] 
[28] 



confirmed with RNA-seq data [34] from H9, another ES 
cell line. For comparison, expression of the younger and 
better-conserved beta-retrovirus HERV-K was 1000-fold 
lower than HERV-H in HI ES cells. HERV-H RNA was 
100-lower in HeLa cells and more than 100-fold lower 
in K562 myelogenous leukemia cells, GSM12878 lympho- 
blastoid cells, HepG2 hepatocellular liver carcinoma cells, 
human umbilical vein endothelial cells (HUVEC), and 
NHEK epidermal keratinocytes. BRD2, a gene that is 
expressed at nearly the same level in all these cell types 
was 25-fold lower in expression than HERV-H in ES cells 
(Figure 5). 

Almost all the HERV-H RNA expressed in ES cells 
had the structure ^LTR-gag-pro-^LTR with deletion of 
the pol and env regions, and no intact open reading 
frames. HERV-H fragments containing pol and env 
sequences were barely detected (Figure 6). 

It has been reported that human ES cell DNA is hypo- 
methylated with respect to differentiated cell lines, and 
that this global effect releases all endogenous elements, 
such as SINEs, LINEs, and HERVs, from transcriptional 
silencing [35]. To determine if HERV-H expression in 
ES cells is simply a result of this global trend, the RNA- 
seq data was used to compare the expression level of all 
repetitive elements in ES cells. Compared to LINEs and 
SINEs, HERV-H was by far the most expressed repetitive 
element in human ES cells, accounting for nearly all 
transcription of the HERV family (Figure 7). 

It is also important to consider that expression of en- 
dogenous retro-elements might be influenced by local 
effects from adjacent transcriptional units. The differ- 
ence between the expression level of the repetitive ele- 
ments and the expression level of the surrounding 
sequences was assessed (see Methods for details). By this 
measure, HERV-H exhibited transcriptional specificity 



comparable to that of conventional genes, in terms of 
specific/unspecific transcription ratio (0.85 vs. 0.86), 
while LINEs and SINEs had a transcription ratio of 0.06 
and 0.03, respectively (Figure 7). 

HERV-H expression over the course of human ES cell 
differentiation 

The disparity between the high transcriptional level of 
HERV-H in human ES cells, and the near absence of 
HERV-H transcription in differentiated cells prompted an 
assessment of HERV-H expression as ES cells differentiate. 
To accomplish this, raw data was analyzed from a 4-point 
time course experiment (http://www.ncbi.nlm.nih.gov/geo/ 
query/acc.cgi?acc=GSE20301) in which RNA was collected 
from human ES cells in the undifferentiated state (NO), early 
initiation (Nl), neural progenitor (N2), and pre-glial cell 
(N3) stages [26]. High HERV-H transcriptional levels were 
confirmed in the undifferentiated NO stage, and HERV-H 
levels decreased progressively with a magnitude that corre- 
lated with the differentiation time-point (Figure 8). 

HERV-H and pluripotency transcription factors 

When ectopically expressed in particular combinations, 
the transcription factors NANOG, OCT4, and SOX2 are 
capable of reprogramming mature somatic cells into pluri- 
potent stem cells [36-39]. Conversely, the expression of 
these factors decreases as cells differentiate [40]. The 
RNA-seq data sets utilized above that measure the dynam- 
ics of RNA in ES cells as they differentiate into neural pro- 
genitors was analyzed for expression of HERV-H and for 
these transcription factors. HERV-H RNA levels correlated 
well with those of NANOG and OCT4 (Figure 9). SOX2 
was more stably expressed during differentiation than was 
NANOG or OCT4 and its expression did not correlate 
with that of HERV-H. 
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Figure 1 Schematic showing the strategy for detecting 
association between endogenous retro-elements and the 
H3K4me3 marker. H3K4me3 CHiPSeq data for more than 40 
different cell types were checked for association within 2 kB of all 
annotated endogenous retroviral elements (hg 1 8) using methods 
described in reference [18]. Only HERV-H proviruses showed 
association with H3K4me3, and this was only significant in human 
ES cell lines. No association between H3K4me3 and endogenous 
retro-elements was detected in the mouse. 



To explore the possibility that these pluripotency tran- 
scription factors are somehow involved in HERV-H 
transcription, an association analysis was conducted be- 
tween HERV-H locations and the ChlPSeq binding 
profiles of NANOG, OCT4, SOX2, and KLF4 extracted 
from ES cells [24,25] (Figure 10). Over a 2 kB window, 
NANOG showed the most significant association (F score 
0.75, p < 10" 180 ). OCT4 weakly associated with HERV-H 
(F score 0.55, p < 0.0001), while SOX2 and KLF4 were not 
associated (F score 0.46 and 0.24, respectively). 

Statistical significance increased considerably when the 
expression level of individual HERV-H proviruses was 
monitored (Figure 11A and B). As visualized on the 
chromosome projection mandala in Figure 11C, NANOG 
binds to the LTRs of 96% of the 50 most highly expressed 
HERV-H proviruses. These 50 proviruses account for 80% 
of all HERV-H RNA. The LTR region corresponding to 
the first 0.5 kB of HERV-H proviral sequence is high- 
lighted in green, with dot sizes proportional to expression 
levels. With a window of 0.5 kB, the F score for NANOG 
was nearly perfect (0.97, p < 10~ 300 ). Limited to this subset 
of HERV-H proviruses the F scores for OCT4 (0.90 wi2kB, 
p < 10~ 300 ) and SOX2 (0.85 wi4kB; p < 10" 300 ) were also 
tightly linked to HERV-H. KLF4, in contrast, was not asso- 
ciated with HERV-H (F score 0.34). The average distance 
of the NANOG, OCT4 and SOX2 ChlPSeq peaks from 
the HERV-H transcription start sites was 250 basepairs, 
1 kB and 2 kB, respectively, as if these factors were 
uniformly distributed along the HERV-H promoter in 




Figure 2 H3K4me3 association with HERV-H proviruses, as measured with the F score, in 40 different cell types, as a function of the 

distance from the marker. The F score is reported on the z-axis. Cell types are listed on the x-axis, while the window size is reported on the y- 

axis. Yellow to red bands indicate a significant association (F score >0.5). Significance decreases as the color shifts from red to blue. 12 out of 15 

of the cell types with significant F scores are either human ES cells or iPS cells. 
\ ) 
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Figure 3 Hierarchical cluster analysis of the HERV-H/H3K4me3 F score, as a function of the window size (horizontal axis). F score is 
represented using the color scale as in Figure 2. The highly associated cluster (red tree) was made of human ES and iPS cells. The cluster with 
medium association (green) consisted of mesenchymal stem cells (DMSC), fetal lung and brain cells, and some iPS cells, with F scores that were 
barely significant. The blue cluster consisted of differentiated cells and the mean F score of 0.36 was insignificant. 




human ES cells (Figure 11B). Indeed, for 80% of the highly 
expressed HERV-H proviruses NANOG, OCT4 and 
SOX2 map in this order. 



immediately adjacent to the transcribed region, confirm- 
ing that HERV-H expression is not a non-specific conse- 
quence of hypomethylation in ES cells. 



Association of LINE elements with NANOG in human ES 
cells 

As a matter of comparison, a chromosome projection 
mandala was created in which NANOG binding sites in 
ES cells were compared with the location of the LINEs 
(Figure 12). Only one LINE, located on chromosome 13 
between nucleotides 55,048,158 and 55,050,476, was 
expressed to an appreciable level. Interestingly, a deeper 
inspection of this region revealed a particular genomic 
structure where LINE transcription was driven by an ad- 
jacent HERV-H, bounded by NANOG binding sites at 
both LTRs. This element does not have an intact ORF, 
and its functional role is unclear. It is also noteworthy 
that the adjacent HERV-L is not expressed, even if it is 



Discussion 

This study discovered a strong association between the 
genomic location of HERV-H proviruses and H3K4me3- 
modified histones, that is exclusive to human ES and 
some iPS cells. Consistent with H3K4me3 serving as 
a marker for active transcription, HERV-H expression 
was high in these pluripotent stem cells. Moreover, 
the pluripotency transcription factors NANOG, OCT4 
and SOX2 bound to the LTRs of transcriptionally ac- 
tive HERV-H proviruses, or within 2 kB of them. 
NANOG, OCT4 and SOX2 frequently co-occupy the 
promoters of target genes, many of which are tran- 
scription factors that regulate development such as 
homeodomain proteins [41]. 
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0.82 0.74 0-28 

Figure 4 Chromosome projection mandalas showing the proximity of each HERV-H provirus to the nearest site of H3K4Me3 on the 
chromosome, in human 13 ES cells, iPS-15b cells, and HeLa cells. Each dot on the mandala indicates an HERV-H provirus, as described in 
reference [18]. The angular distance around the mandala indicates the linear position of each provirus on the indicated chromosome. The radial 
distance from the perimeter indicates the distance of the provirus from the nearest H3K4Me3 site, in log scale from 0 to 1 megabase. Blue dots 
are HERV-H proviruses within 2 kB from the nearest marker. Red dots are proviruses >2 kB away from the nearest H3K4Me3 site. The association 
strength (F score) is written under each Mandala. F score > 0.5 constitutes a significant association. 



These observations strongly support the hypothesis that 
HERV-H transcripts play a role in human pluripotency and 
that this role is finely regulated by three of the most import- 
ant transcription factors in ES cells. In addition to the bind- 
ing of NANOG, OCT4, and SOX2 to the HERV-H 
promoter, HERV-H RNA decreased as ES cells differen- 
tiated, in a manner that was proportional to the expression 
of NANOG and OCT4. Conversely, HERV-H RNA was un- 
detectable in primary fibroblasts but increased enormously 
after forced re-programming to generate pluripotent stem 



cells (unpublished data provided by Audrey Letourneau 
and Stylianos Antonarakis). HERV-H, then, can be 
exploited as a reliable marker of ES cell pluripotency, as 
well as an indicator of the degree of "sternness" of iPS cells 
as they are generated from fibroblasts. 

HERV-H transcripts are 5 to 6 kB in length and lack 
open reading frames. We can only speculate about the 
function of these IncRNAs. They might, for example, serve 
to soak-up miRNAs that promote differentiation, as has 
been shown with linc-MDl in muscle differentiation [42] 




Figure 5 Cumulative expression of all HERV-H proviruses in human HI ES cells (hESC), HeLa cells, or K562 cells, compared to 
expression of HERV-K and BRD2, a constitutive gene with the same expression level in all three cell types. In human ES cells, HERV-H is 
expressed 1000-fold higher than HERV-K and 25-fold higher than BRD2. HERV-H expression is barely detectable in HeLa, and no significant HERV- 
H expression was detected in K562 cells. RNASeq data for this analysis were from reference [33]. 

V ) 
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Figure 6 Mapping of RNA-seq reads from HI human ES cells on a schematic of the HERV-H provirus. The quantity of each RNA read was 
normalized to the reads corresponding to the 5' LTR. Only RNA fragments corresponding to 5'UR-gog-pro-3'UR were expressed to a significant 
level in human ES cells. 



or the PTENP1 pseudogene in the regulation of PTEN 
and growth suppression [43]. They might bind to chroma- 
tin and act as a scaffold for the local recruitment of pluri- 
potency transcription factors, similar to other IncRNAs 
like HOTAIR for histone modification complexes [44] and 
Xist in the context of X- chromosome inactivation [44-46] . 
Alternatively, HERV-H might counteract retrovirus spread 
by interfering with packaging of retroviral genomic RNA 
[47,48] or by soaking up miRNAs that are required for 
retrovirus transduction. 



The study here failed to identify chromatin markers 
that associate with endogenous retro-elements in mice. 
This was somewhat surprising given the many endogen- 
ous retro-elements in this species, including endogenous 
gamma-retroviruses, some of which are intact and func- 
tional [8]. It was also surprising because exogenous 
gamma-retroviruses have the same integration site pre- 
ferences in mouse cells as they have in human cells [18]. 
MLV integration sites are associated with the H3K4me3 
profile in mouse embryonic fibroblasts (F score = 0.83; 
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Figure 7 HERV-H expression accounts for nearly all HERV expression in human ES cells and is a not a non-specific consequence of 
wide-spread hypo-methylation in these cells. Quantitation of the RNA-seq reads from H1 ES cells, broken down according to the LINES, SINES, 
all HERVs, the nearly 1,000 HERV-H proviruses, and conventional genes. HERV-H RNA accounted for nearly all the HERV RNA in human ES cells, 
and 2% of total RNA. Specific vs. non-specific expression was determined by comparing the expression level of each element to the surrounding 
sequences. 
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Figure 8 HERV-H expression correlates with differentiation status. HERV-H expression as H1 human ES cells differentiate down a pathway 
towards neural progenitors and early glial cells. Black bars indicate unspecific expression; white bars represent specific expression, adjusted for 
expression as described in Figure 7. BRD2 has the same expression level at each stage of differentiation and was used to normalize HERV-H RNA 
levels. 



p < 10~ 100 ). Similar results with murine hematopoietic 
stem cells (F score of 0.81; p < 10" 100 ) indicate that, as in 
human cells, the association strength is cell-type dependent. 

One possible explanation for the failure to identify chro- 
matin markers associated with endogenous murine retro- 
viruses is species-specific differences in the recruitment of 
the transcriptional silencing machinery. In murine ES 
cells, for example, a sequence-specific DNA-binding pro- 
tein, ZNF809 [49], recruits TRIM28 and other compo- 
nents of the cellular machinery that silence MLV [50]. 
ZNF809 has no orthologue in humans; perhaps ZNF809 
arose as a result of selective pressure exerted by murine 
specific gamma-retroviruses during evolution. 



Previous work demonstrated that when exogenous retro- 
viruses integrate they home to sites of H3K4me3 [18]. 
Similarly, the association of endogenous gamma-retrovirus 
HERV-H with H3K4me3 suggests that when human and 
simian germ cells were bombarded with the HERV-H an- 
cestor 15 to 30 millions years ago, these ancient retro- 
viruses integrated in proximity to H3K4me3-marked 
chromatin. These proviruses might then have retained 
these cell-type specific marks as they became fixed in the 
primate genome. Alternatively, unmethylated HERV-H 
LTRs might have recruited chromatin remodeling factors 
and induced H3K4me3 modification of the viral promoter 
after integration had occurred. 
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Figure 9 HERV-H expression levels correlate with those of pluripotency transcription factors NANOG and OCT4 as human ES cells 
move down a differentiation pathway. NO, undifferentiated ES cells. Nl, early initiation stage of differentiation. N2, neural progenitor stage. 
0CT4 and NANOG are positively correlated with HERV-H (p = 0.95, p = 0.84, respectively). SOX2 shows no correlation. 
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HERV-H / S0X2 HERV-H / KLF4 

Figure 10 Modified chromosome projection mandalas depicting NANOG, 0CT4, S0X2 and KLF4 associations with HERV-H proviruses. 

Mandalas were generated as described in reference [18] and Figure 4 above. The green ring indicates the HERV-H 5' LTR region (500 nucleotides 
from the transcriptional start site). Dot size is proportional to the expression level of each single provirus. NANOG is bound to the 5'LTR of almost 
all highly expressed retroviruses. 0CT4 shows a significant association. S0X2 binds all expressed HERV-H in a region between 1 KB and 2 KB while 
KLF4 does not show any significant association pattern. 



Analysis of the DNA surrounding HERV-H pro- 
viruses failed to clarify which of these two scenarios is 
more likely. Additionally, search for epigenetic markers 
like H3K4me3 in syntenic regions in the mouse gen- 
ome was attempted to determine if these chromatin 



marks are conserved across the species and predate the 
entry of HERV-H into the primate genome. DNA sur- 
rounding HERV-H proviruses in the human genome 
was aligned to the mouse genome (using the tool Lift- 
Over, http: 1 1 genome.ucsc.edu! cgi-bin I hgLiftOver) after 
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Figure 11 (See legend on next page.) 
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(See figure on previous page.) 

Figure 11 Ordered spacing of pluripotency transcription factors, binding to the HERV-H 5' LTR in human ES cells. (A) Association 
strength (F score) of NANOG, 0CT4 and S0X2 binding sites with the 50 most highly expressed HERV-H proviruses (accounting for 80% of total 
HERV-H expression), as a function of distance from the HERV-H transcription start site (TSS). Maxima in F score indicate the distance of greatest 
association. (B) Average distance of NANOG (red), 0CT4 (blue) and S0X2 (green) to HERV-H TSS is shown schematically. As expected from a 
uniform distribution model, the average distance is half of the distance between maximal association and TSS. (C) Chromosome Projection 
Mandala combining NANOG, 0CT4 and S0X2 with respect to 50 HERV-H proviruses. The three embryonic transcription factors bind with the 
same order (NANOG-OCT4-SOX2) to the promoter region of the most expressed HERV-Hs. 
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Figure 12 (A) Chromosome projection mandala depicting the association between NANOG and expressed LINEs in human ES cells. Only 
one LINE (large blue dot on chromosome 13) was expressed to high level in these cells. (B) The genomic region around the LINE on chromosome 13 
is shown with the UCSC Genome Browser, where the linear chromosome is mapped on the horizontal axis. The position of the adjacent HERV-H and 
HERV-L proviruses is shown. Four biological replicates confirm HERV-H and LINE expression in human ES cells while no expression was detected in 
K562 cells (only one of the 4 replicates is shown). The direction of transcription was determined by strand-specific sequencing [33]. NANOG bound to 
both LTRs (represented with black squares along the LTR row) in the adjacent HERV-H. The adjacent HERV-L was not transcribed. 



Santoni et al. Retrovirology 201 2, 9:1 1 1 
http://www.retrovirology.eom/content/9/1/1 1 1 



Page 12 of 15 



excision of all repetitive elements (performed with 
RepeatMasker, http://www.repeatmasker.org/). Nothing 
informative was found by measuring the association of 
mouse H3K4me3 with these syntenic regions and com- 
paring these values with those obtained using control 
loci. 

P300 and H3K27ac bind intra-species conserved 
regions and co-localize in embryonic-specific enhancers 
[34,51]. These markers are also associated with HERV- 
H in human ES cells, lying within 4 kB of 80% of 
HERV-H proviruses (F score 0.95; p < 10" 300 ). It seems 
unlikely that an exogenous retrovirus would be capable 
of recruiting these factors by exploiting random con- 
served regions around its integration site. This suggests 
that a pre-existing layer of epigenetic markers favored 
integration of HERV-H into particular host loci and 
that these features are still preserved millions of years 
later. 

Conclusions 

Among retroelements in the human genome, the en- 
dogenous gammaretrovirus HERV-H is extraordinary for 
its high level expression in embryonic stem cells, in 
which it makes up 2% of all polyadenylated RNA. The 
human genome has ~ 1,000 copies of HERV-H, and the 
majority of the HERV-H RNA is encoded by a subset 
of 50 of these. HERV-H expression decreases as ES cells 
lose pluripotency, to the point where its expression 
is undetectable in fibroblasts. Consistent with this 
expression pattern, HERV-H is also expressed to high 
level in many iPS cells, though expression in some iPS 
cells is more modest; this heterogeneity may reflect 
reported differences in the epigenetic profile of many 
iPS cells, when compared with ES cells [29]. This sug- 
gests, then, that HERV-H RNA offers a relatively strin- 
gent marker for human pluripotency that would be 
worth monitoring during the generation of new iPS 
lines. The HERV-H RNAs in ES cells average about 5 
kB in length and encode no protein. It, therefore, 
seems likely that HERV-H RNA contributes to pluri- 
potency by acting as a chromatin-associated structural 
element or by acting as a microRNA decoy. 

Methods 

Statistical analysis of association between chromatin 
markers and retro-elements 

Association with a given marker was defined as the 
presence of the endogenous proviral DNA within a 
fixed distance (usually 2 kilobases) from the nearest 
marker on the linear sequence of the chromosome. Un- 
like exogenous retroviruses, the endogenous virus is 
already integrated. Therefore, to restore the conditions 
before integration, the distance from the j-esim marker 
Mj with peaks in loci {m y0 , . . ., m ;N } and the i-esim 



provirus V t spanning along the loci {Vf, Vf} has been 
calculated as 

d(V h Mj) = mm(\Vi S -Mjl\Vi e -Mj\) 

ye _ ys 

= minflVf-M,!)--*-^-*- 

where V- = V[ ^ is the central locus of the provirus. 

As a control dataset, we randomly selected 100000 
genomic locations. Association strength was measured 
with the statistical method based on the F score, as pre- 
viously described in [18]. 

Formally the F^-score is defined as the p-weighted 
harmonic mean of Precision(P) and Sensitivity (R): 

Here p = 0.5 to give more weight to Precision than to 
Sensitivity. This balances type I and type II errors by 
adjusting for the high rate of False Positives inherent to 
the examination of large datasets for genome-wide bind- 
ing sites according to statistical significance (F score 
based statistics and comparison with other measures 
have been extensively discussed in [18]). 

Markers with F scores ranging between 0.5 and 1 were 
considered to be associated with endogenous integration 
sites. 

RNASeq data analysis 

HERV-H is present in more than 1000 "imperfect" cop- 
ies in the human genome and its transcripts share a 
number of short conserved regions (each around 
100 bp). Therefore, deep sequencing of those transcripts 
yields reads (25-75 bp long sequences) which perfectly 
align to several genomic loci. Indeed, multireads map- 
ping is still a challenging process [52]. The strategy 
adopted here was to perform the alignment of uniquely 
mapped single- and paired-end reads and to reassign the 
multiple-mapped reads in function of the expression 
level of the surrounding (context) region [26]. HERV-H 
expression was evaluated in term of "Reads Per Kilobase 
per Million mapped reads (RPKM)" with the standard 
formula E r = K [53] where N r is the number of 
reads mapping onto the r transcript, L r is the length (in 
kB) of the r-esim transcript and N T is the total number 
of reads 7C=10 6 

Alignments of RNASeq generated reads have been per- 
formed with a two-step procedure. First we used Bowtie 
[54] on raw data, admitting up to two mismatches for each 
alignment, and then we discriminated between unique 
mapping reads and multireads, 80% and 20% of the total 
number of reads respectively. As expected, many multi- 
reads matched with repeated elements. Ignoring them 
would have resulted in a potential underestimation of the 
expression of endogenous retroviruses and, in general, of all 
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repetitive elements. Therefore, we adopted a probabilistic 
assignment based on the amount of reads that univocally 
map onto the surrounding regions (context), as described 
in more detail in the next paragraph. This evaluation is also 
useful to establish if a repeated element is expressly tran- 
scribed or if it is part of another structure (i.e. a gene). 

Probabilistic alignment of multiple mapping reads 

The uniquely mapping read was defined as a short se- 
quence generated by high throughput sequencing that can 
be aligned to a single genomic region s. Accordingly, we 
defined the multiread as a sequence r that aligns to a set 
of M regions {Si,. . .,S M }. For each region s t the context re- 
gion is Ci = CJiCi D Si) being Q a genomic region of n 
nucleotides encompassing s t . The assumption that the 
amount of reads is proportional to the amount of actual 
mRNA implies that the set of multireads is distributed on 
the reference genome accordingly to the amount of 
uniquely mapping reads aligned to the context regions. 
Consider SRd(s) as the function giving the number of 
reads of the dataset D that map univocally to the genomic 
region s described by the tuple (chr, start, end). Therefore, 
the probability of the read r to be actually part of the 
mRNA generated from the region s t is estimated as: 

D ( \ jMg) 

Mres t )=— . 

Eventually, the set of multireads mapping to the same M 
regions is then partitioned to {Si,. . .,Sm} accordingly to P D (s t ). 

Specificity 

The first axiom of high throughput sequencing asserts that 
the number of reads aligned to a specific genomic region is 
proportional to how much RNA has been generated by this 
region within the cell. At this point it is worth observing 
that repeated element (RE) sequences might be present in 
the RNA just because they are part of longer mRNAs. 

Since we expect that elements having a specific bio- 
logical function are independently transcribed, we 
attempted to distinguish between RE specifically expressed 
with their own promoter from those that are part of 
longer RNAs. The number of reads mapped to a region s 
can be naively modeled as a linear combination of specific 
reads T(s), unspecific reads U(s) and additional zero-mean 
noise o* 2 that account for all other experimental and non- 
systematic fluctuations that can randomly influence the 
output of the sequencing process. Formally: 

91(5) = U(S) + T(s) + a 2 

where 9t(s), as before, is the function giving the number 
of reads assigned to the region s. 



Therefore, the mean number of reads in the region s 

is: 

E[T(s)] = SR(s) - U(s). 

In order to estimate U(s), it is possible to count the 
number of reads mapped to the context region c as pre- 
viously shown. Therefore we set 



and we eventually adopt the following approximation to 
correct for the non-specificity of transcription: 

E[T(s)]u<R(s) 
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